spearman 0
Representation Integrity in Temporal Graph Learning Methods
Real-world systems ranging from airline routes to cryptocurrency transfers are naturally modelled as dynamic graphs whose topology changes over time. Conventional benchmarks judge dynamic-graph learners by a handful of task-specific scores, yet seldom ask whether the embeddings themselves remain a truthful, interpretable reflection of the evolving network. W e formalize this requirement as representation integrity and derive a family of indexes that measure how closely embedding changes follow graph changes. Three synthetic scenarios--Gradual Merge, Abrupt Move, and Periodic Re-wiring--are used to screen forty-two candidate indexes. Based on which we recommend one index that passes all of our theoretical and empirical tests. In particular, this validated metric consistently ranks the provably stable UASE and IPP models highest. W e then use this index to do a comparative study on representation integrity of common dynamic graph learning models. This study exposes the scenario-specific strengths of neural methods, and shows a strong positive rank correlation with one-step link-prediction AUC. The proposed integrity framework, therefore, offers a task-agnostic and interpretable evaluation tool for dynamic-graph representation quality, providing more explicit guidance for model selection and future architecture design.
- North America > Canada > Quebec > Montreal (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
- (5 more...)
- Health & Medicine (0.67)
- Banking & Finance > Trading (0.47)
- Information Technology (0.46)
Do Code Models Suffer from the Dunning-Kruger Effect?
Singh, Mukul, Chatterjee, Somya, Radhakrishna, Arjun, Gulwani, Sumit
As artificial intelligence systems increasingly collaborate with humans in creative and technical domains, questions arise about the cognitive boundaries and biases that shape our shared agency. This paper investigates the Dunning-Kruger Effect (DKE), the tendency for those with limited competence to overestimate their abilities in state-of-the-art LLMs in coding tasks. By analyzing model confidence and performance across a diverse set of programming languages, we reveal that AI models mirror human patterns of overconfidence, especially in unfamiliar or low-resource domains. Our experiments demonstrate that less competent models and those operating in rare programming languages exhibit stronger DKE-like bias, suggesting that the strength of the bias is proportionate to the competence of the models.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Europe > Ireland (0.04)
- Asia > Singapore (0.04)
- Asia > Indonesia > Bali (0.04)
- Overview (1.00)
- Research Report > New Finding (0.94)
MathBuddy: A Multimodal System for Affective Math Tutoring
Kar, Debanjana, Böss, Leopold, Braca, Dacia, Dennerlein, Sebastian Maximilian, Hubig, Nina Christine, Wintersberger, Philipp, Hou, Yufang
The rapid adoption of LLM-based conversational systems is already transforming the landscape of educational technology. However, the current state-of-the-art learning models do not take into account the student's affective states. Multiple studies in educational psychology support the claim that positive or negative emotional states can impact a student's learning capabilities. To bridge this gap, we present MathBuddy, an emotionally aware LLM-powered Math Tutor, which dynamically models the student's emotions and maps them to relevant pedagogical strategies, making the tutor-student conversation a more empathetic one. The student's emotions are captured from the conversational text as well as from their facial expressions. The student's emotions are aggregated from both modalities to confidently prompt our LLM Tutor for an emotionally-aware response. We have evaluated our model using automatic evaluation metrics across eight pedagogical dimensions and user studies. We report a massive 23 point performance gain using the win rate and a 3 point gain at an overall level using DAMR scores which strongly supports our hypothesis of improving LLM-based tutor's pedagogical abilities by modeling students' emotions. Our dataset and code are available at: https://github.com/ITU-NLP/MathBuddy .
- Europe > Austria (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- (3 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Questionnaire & Opinion Survey (0.89)
- Education > Educational Technology (1.00)
- Information Technology > Security & Privacy (0.67)
- Education > Curriculum > Subject-Specific Education (0.46)
GeoSR: Cognitive-Agentic Framework for Probing Geospatial Knowledge Boundaries via Iterative Self-Refinement
Tang, Jinfan, Wu, Kunming, Gongxie, Ruifeng, He, Yuya, Wu, Yuankai
Recent studies have extended the application of large language models (LLMs) to geographic problems, revealing surprising geospatial competence even without explicit spatial supervision. However, LLMs still face challenges in spatial consistency, multi-hop reasoning, and geographic bias. To address these issues, we propose GeoSR, a self-refining agentic reasoning framework that embeds core geographic principles -- most notably Tobler's First Law of Geography -- into an iterative prediction loop. In GeoSR, the reasoning process is decomposed into three collaborating agents: (1) a variable-selection agent that selects relevant covariates from the same location; (2) a point-selection agent that chooses reference predictions at nearby locations generated by the LLM in previous rounds; and (3) a refine agent that coordinates the iterative refinement process by evaluating prediction quality and triggering further rounds when necessary. This agentic loop progressively improves prediction quality by leveraging both spatial dependencies and inter-variable relationships. We validate GeoSR on tasks ranging from physical-world property estimation to socioeconomic prediction. Experimental results show consistent improvements over standard prompting strategies, demonstrating that incorporating geostatistical priors and spatially structured reasoning into LLMs leads to more accurate and equitable geospatial predictions. The code of GeoSR is available at https://github.com/JinfanTang/GeoSR.
- Asia > China > Sichuan Province > Chengdu (0.05)
- North America > United States > California > San Diego County > La Jolla (0.04)
- Africa > Sub-Saharan Africa (0.04)
- (16 more...)
AggTruth: Contextual Hallucination Detection using Aggregated Attention Scores in LLMs
Matys, Piotr, Eliasz, Jan, Kiełczyński, Konrad, Langner, Mikołaj, Ferdinan, Teddy, Kocoń, Jan, Kazienko, Przemysław
In real-world applications, Large Language Models (LLMs) often hallucinate, even in Retrieval-Augmented Generation (RAG) settings, which poses a significant challenge to their deployment. In this paper, we introduce AggTruth, a method for online detection of contextual hallucinations by analyzing the distribution of internal attention scores in the provided context (passage). Specifically, we propose four different variants of the method, each varying in the aggregation technique used to calculate attention scores. Across all LLMs examined, AggTruth demonstrated stable performance in both same-task and cross-task setups, outperforming the current SOTA in multiple scenarios. Furthermore, we conducted an in-depth analysis of feature selection techniques and examined how the number of selected attention heads impacts detection performance, demonstrating that careful selection of heads is essential to achieve optimal results.
- Research Report > Experimental Study (0.47)
- Research Report > New Finding (0.47)
Accurate and Uncertainty-Aware Multi-Task Prediction of HEA Properties Using Prior-Guided Deep Gaussian Processes
Alvi, Sk Md Ahnaf Akif, Mulukutla, Mrinalini, Flores, Nicolas, Khatamsaz, Danial, Janssen, Jan, Perez, Danny, Allaire, Douglas, Attari, Vahid, Arroyave, Raymundo
Surrogate modeling techniques have become indispensable in accelerating the discovery and optimization of high-entropy alloys(HEAs), especially when integrating computational predictions with sparse experimental observations. This study systematically evaluates the fitting performance of four prominent surrogate models conventional Gaussian Processes(cGP), Deep Gaussian Processes(DGP), encoder-decoder neural networks for multi-output regression and XGBoost applied to a hybrid dataset of experimental and computational properties in the AlCoCrCuFeMnNiV HEA system. We specifically assess their capabilities in predicting correlated material properties, including yield strength, hardness, modulus, ultimate tensile strength, elongation, and average hardness under dynamic and quasi-static conditions, alongside auxiliary computational properties. The comparison highlights the strengths of hierarchical and deep modeling approaches in handling heteroscedastic, heterotopic, and incomplete data commonly encountered in materials informatics. Our findings illustrate that DGP infused with machine learning-based prior outperform other surrogates by effectively capturing inter-property correlations and input-dependent uncertainty. This enhanced predictive accuracy positions advanced surrogate models as powerful tools for robust and data-efficient materials design.
- North America > United States > Texas > Brazos County > College Station (0.14)
- North America > United States > New Mexico > Los Alamos County > Los Alamos (0.05)
- Europe > Germany (0.04)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- Energy (1.00)
- Government > Regional Government > North America Government > United States Government (0.46)
- Government > Military (0.46)
Foundation for unbiased cross-validation of spatio-temporal models for species distribution modeling
Koldasbayeva, Diana, Zaytsev, Alexey
Species Distribution Models (SDMs) often suffer from spatial autocorrelation (SAC), leading to biased performance estimates. We tested cross-validation (CV) strategies - random splits, spatial blocking with varied distances, environmental (ENV) clustering, and a novel spatio-temporal method - under two proposed training schemes: LAST FOLD, widely used in spatial CV at the cost of data loss, and RETRAIN, which maximizes data usage but risks reintroducing SAC. LAST FOLD consistently yielded lower errors and stronger correlations. Spatial blocking at an optimal distance (SP 422) and ENV performed best, achieving Spearman and Pearson correlations of 0.485 and 0.548, respectively, although ENV may be unsuitable for long-term forecasts involving major environmental shifts. A spatio-temporal approach yielded modest benefits in our moderately variable dataset, but may excel with stronger temporal changes. These findings highlight the need to align CV approaches with the spatial and temporal structure of SDM data, ensuring rigorous validation and reliable predictive outcomes.
- Europe > Austria > Vienna (0.14)
- Europe > Norway (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (7 more...)
Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation
Gao, Mingqi, Hu, Xinyu, Lin, Li, Wan, Xiaojun
The correlation between NLG automatic evaluation metrics and human evaluation is often regarded as a critical criterion for assessing the capability of an evaluation metric. However, different grouping methods and correlation coefficients result in various types of correlation measures used in meta-evaluation. In specific evaluation scenarios, prior work often directly follows conventional measure settings, but the characteristics and differences between these measures have not gotten sufficient attention. Therefore, this paper analyzes 12 common correlation measures using a large amount of real-world data from six widely-used NLG evaluation datasets and 32 evaluation metrics, revealing that different measures indeed impact the meta-evaluation results. Furthermore, we propose three perspectives that reflect the capability of meta-evaluation and find that the measure using global grouping and Pearson correlation exhibits the best overall performance, involving the discriminative power, ranking consistency, and sensitivity to score granularity.
- Asia > Singapore (0.04)
- North America > United States > New York (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (13 more...)